Loading the required packages
library(tidyverse)
library(dplyr)
library(ggplot2)
Import processed data, which can be found here.
#read preprocessed data
wines <- read.csv(file = '../data/processed_data/wines.csv')
Get sample of dataset
#set seed value to birthday of Ricardo Rodriguez, American wrestler and ring announcer and Dr. Reinaldo (Rei) Sanchez-Arias
set.seed(19630217)
#set percentage to test with for simplicity, if needed
percentage <- 5
wine_sample<- sample_n(wines, percentage/100*nrow(wines))
Wines are normally classified in categories as found on the website. To create a more rich dataset we added the field rating_category determined as:
| Category | Rating | Description |
|---|---|---|
| Classic | 98-100 | The pinnacle of quality. |
| Superb | 94-97 | A great achievement. |
| Excellent | 90-93 | Highly recommended. |
| Very Good | 87-89 | Often good value; well recommended. |
| Good | 83-86 | Suitable for everyday consumption; often good value. |
| Acceptable | 80-82 | Can be employed in casual, less-critical circumstances |
Also, this allows to take a quantitative data and turn it into qualitative data, which can be used for data visualization later.
# function to add rating
rating_category <- function(points){
if(points>=98){
return("Classic")
}
else if (points>=94){
return("Superb")
}
else if(points>=90){
return("Excellent")
}
else if(points>=87){
return("Very Good")
}
else if(points>=83){
return("Good")
}
else{
return("Acceptable")
}
}
wines<- wines %>%
rowwise() %>%
mutate(rating_category = rating_category(points))
head(wines)
To make our dataframes more managable we split reduntant information about the tasters into a new dataframe.
tasters <- wines %>%
select(taster_name, taster_twitter_handle) %>%
unique()
tasters
Drop taster_twitter_handle in wines dataframe
wines <- wines %>%
select(-taster_twitter_handle)
head(wines)
Each reviewer has there own bias. In order to offset that we made a “profile” for each reviewer, This profile allows us to later normalize the wine points for more robust apples to apples comparison. Each reviewers profile includes additional characteristics like:
avg_points which is the avgerage of all the reviewer’s scores
sd_points which is the standard deviation of all the reviewer’s scores
var_points which is the variance of all the reviewer’s scores
reviews which is the number of reviews conducted
taster_rating_profile <- wines %>%
group_by(taster_name) %>%
summarize(
avg_points = mean(points),
sd_points = sd(points),
var_points = var(points),
reviews = n()
)
Grouping rowwise data frame strips rowwise nature
tasters <- inner_join(tasters, taster_rating_profile, by = "taster_name")
head(tasters)
Since, each reviewer has a different bias we created a normalized metric, norm_points, by looking at the number of standard deviatioins a wine is from the reviewer’s avg_points. This gives use a more accurate representation of which which wines are “better” than the rest.
normalize_points <- function(data){
left_join(data, tasters, by = "taster_name")%>%
rowwise() %>%
mutate(norm_points = (points-avg_points)/sd_points) %>%
select(-avg_points, -sd_points, -var_points, -taster_twitter_handle, -reviews)
}
wines <- normalize_points(wines)
head(wines)
Here we do hard checks to clean data to ensure integrity of our data.
Vintage seems to have year 7200, so we filtered all data up to 2019
wines <- wines %>%
filter(vintage<2019)
Before, conducting any detailed analysis of our dataset, we looked at a quick summary of the dataset
summary(wines)
title
Château Lestage Simon 2012 Haut-Médoc : 4
Château Maine-Gazin 1997 Bordeaux : 4
Concannon 2009 Selected Vineyards, Central Coast Pinot Noir (Livermore Valley): 4
Domaine Vacheron 2015 Sancerre : 4
Vignerons des Pierres Dorées 2015 Salamandre d'Or (Coteaux Bourguignons) : 4
Adega Cooperativa de Borba 2009 Senses Syrah (Alentejano) : 3
(Other) :117855
alcohol category vintage designation
Min. : 2.20 Dessert : 1648 Min. :1150 :35285
1st Qu.: 13.00 Fortified : 60 1st Qu.:2009 Reserve : 1924
Median : 13.53 Port/Sherry: 386 Median :2011 Estate : 1236
Mean : 13.84 Red :72354 Mean :2010 Reserva : 1214
3rd Qu.: 14.40 Rose : 4203 3rd Qu.:2013 Riserva : 669
Max. :8333.00 Sparkling : 2273 Max. :2017 Estate Grown: 584
NA's :9060 White :36954 (Other) :76966
country province region
US :50228 California :33418 :19360
France :19164 Washington : 8106 Napa Valley : 4234
Italy :17497 Bordeaux : 5665 Columbia Valley (WA): 3855
Spain : 5890 Tuscany : 5647 Russian River Valley: 2896
Portugal: 4905 Oregon : 4824 California : 2200
Chile : 4235 Northern Spain: 3712 Paso Robles : 2193
(Other) :15959 (Other) :56506 (Other) :83140
subregion variety
:70775 Pinot Noir :12071
Central Coast :10277 Chardonnay :10813
Sonoma : 8527 Cabernet Sauvignon : 8933
Columbia Valley : 7617 Red Blend : 8233
Napa : 6450 Bordeaux-style Red Blend: 6564
Willamette Valley: 3286 Riesling : 4893
(Other) :10946 (Other) :66371
winery price points
DFJ Vinhos : 208 Min. : 4.00 Min. : 80.00
Williams Selyem : 207 1st Qu.: 17.00 1st Qu.: 86.00
Testarossa : 205 Median : 25.00 Median : 88.00
Louis Latour : 193 Mean : 35.48 Mean : 88.47
Chateau Ste. Michelle: 192 3rd Qu.: 42.00 3rd Qu.: 91.00
Georges Duboeuf : 192 Max. :3300.00 Max. :100.00
(Other) :116681 NA's :7974
taster_name rating_category norm_points
:23945 Length:117878 Min. :-4.49580
Roger Voss :22265 Class :character 1st Qu.:-0.72711
Michael Schachner:13867 Mode :character Median : 0.03980
Kerin O’Keefe : 9747 Mean : 0.01183
Virginie Boone : 8926 3rd Qu.: 0.70027
Paul Gregutt : 8747 Max. : 4.46378
(Other) :30381
To better understand the distribution of our data we did some simple univariate visualization based on certain fields. Additionally, before doing a multivariate analysis and answering our research questions we first want to ensure our dataset is robust and an accurate representation of the real world.
The visualization below depicts the distribution of our dataset based on alcohol percentage, alcohol . To better understand and visualize the data we categorized the graph based on rating_category. Notice, a majority of wines have an alcohol amount between 12%-15% and according to Real Simple wine alcholic content averages between 11%-13%. This leads us to believe our data is an accurate representation of the real world.
wines %>%
group_by(alcohol) %>%
ggplot() +
geom_histogram(
mapping = aes(
x = alcohol,
fill = rating_category),
na.rm = TRUE,
bins = 50) +
scale_x_continuous(
breaks = seq(0,25,1),
limits = c(4,22)) +
labs(
title = "Distribution of Alcohol Percentage",
x = "Alcohol Percentage",
y = "Count",
fill = "Rating Category"
)
Grouping rowwise data frame strips rowwise nature
Next, we wanted to see what vintage most of the wines in the dataset were. Again to better understand and visualize the data we categorized the graph based on rating_category. Notice, that there is roughly, an equal percentage of each rating category per vintage.
(Note: Data points before 1990 have been omitted for clarity in visualization)
wines %>%
ggplot() +
geom_bar(
mapping = aes(
x=vintage,
fill = rating_category),
na.rm = TRUE) +
scale_x_continuous(
breaks = seq(1990,2019,5),
limits = c(1990,2019)) +
labs(
title = "Distribution of Vintage",
x = "Vintage",
y = "Count",
fill = "Category")
To better understand the number wines per winery, we did a visualization that counts the number of wines per winery showing only Top 10 winerys to give you an idea what winery has the most selction of wines. Notice, each of the top 10 producers of wine have over 100 different wine labels.
wines %>%
group_by(winery) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
ggplot() +
geom_col(
mapping = aes(
x= reorder(winery, count),
y = count,
fill = winery)) +
labs(
title = "Distribution of Winery (Top 10)",
x = "Winery",
y = "Count") +
theme(legend.position = "none") +
coord_flip()
Grouping rowwise data frame strips rowwise nature
To better understand the number wines per province, we did a visualization that counts the number of wines per province showing only the top 10 provinces with the most wines. This can give the reader an idea where their wine will most likely be made with California standing out as a clear leader.
wines %>%
group_by(province) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
ggplot()+
geom_col(
mapping = aes(
x = reorder(province, count),
y = count,
fill = province)) +
labs(
title = "Distribution of Province (Top 10)",
x = "Province",
y = "Count") +
theme(legend.position = "none") +
coord_flip()
Grouping rowwise data frame strips rowwise nature
Next, we wanted to see how
Next, we wanted to visualize the distribution of price in our dataset. To better understand and visualize the data we categorized the graph based on rating_category. Notice, a majority of wines are $50 and below with the most common being between $12 - $25. Again, this accurately represents the real world as stated by Vivino the average price for good bottle of red/white wine is ~$15 and ~$28 for a very good bottle. (CAUTION: The Vivino prices denoted were simply an average for red/white wines average costs. This was done to simply generalize the information to do a simple comparsion. Also, this limited to red/white wine and does not accurately include other types)
(Note: Data points above $400 have been omitted for clarity in visualization)
wines %>%
filter(price < 400) %>%
ggplot() +
geom_histogram(
mapping = aes(
x=price,
fill = rating_category),
binwidth = 15) +
labs(
title = "Distribution of Price",
x = "Price",
y = "Count",
fill = "Category")
Next, we wanted to visualize the distribution of points in our dataset. Notice, here that a majority of wines recieve a score between 87 and 90. Which is accurate to the information provided on Wine Searcher which states 50% of the scores fall between 86-90 point from Wine Enthusiast ratings.
(Note: We our dataset was reterived from the Wine Enthusiast website)
wines %>%
ggplot() +
geom_histogram(
mapping = aes(x=points),
bins = 20)+
labs(
title = "Distribution of Points",
x = "Points",
y = "Count")
Now, that we have a better understanding of our data and we know it is a accuare representation of the real world we can preform an more detailed anaylsis using multiple variables.
To understand the point distribution by tasters, we did a multivarite visualization that coorelates taster names based on the average wine points as identified by the x-intercept. This gives the reader an idea of how some reviewers correlate to the overall average.
(Note: The “blank” represents unknown reviewers. We assumed the reviewers not named have not rated a signficant amount of wines and can be grouped into a singular reviewer)
wines %>%
ggplot() +
geom_boxplot(
mapping = aes(
x=taster_name,
y=points,
color = taster_name)) +
geom_hline(yintercept = mean(wines$points)) +
theme(legend.position = "none")+
labs(
title = "Points by Taster",
x="Taster Name",
y="Points"
)+
coord_flip()
To understand the price distribution by points, we did a multivarite visualization that creates a scatter plot of the wines based on points and price. Then, we added a smooth transformation on identify trends. Notice, the data is “stacked” and the socres range from 80-100
wines %>%
ggplot() +
geom_point(
mapping = aes(
x = points,
y = price,
color = category),
na.rm = TRUE,
alpha = 0.2) +
labs(
title = "Price by Points",
x = "Points",
y = "Price",
fill = "Wine Category") +
geom_smooth(
mapping = aes(
x = points,
y = price),
na.rm = TRUE)
Since, there are multiple outliers and the visualization is clustered. By preforming a log on all the prices we can reduce the skewness of the visualization. Notice, as quality of wine increases price increases exponentially.
wines %>%
ggplot() +
geom_point(
mapping = aes(
x = points,
y = log(price),
color = category),
na.rm = TRUE,
alpha = 0.2) +
labs(
title = "log(Price) by Points",
x = "Points",
y = "log(Price)",
color = "Wine Category") +
geom_smooth(
mapping = aes(
x = points,
y = log(price)),
na.rm = TRUE)
Next, our group wanted to do a more granular analysis by looking at how the price varies by points grouped by wine category. Notice, all the price go up as points go up, but the growth rates are different per wine category.
wines %>%
ggplot() +
geom_point(
mapping = aes(
x = points,
y = log(price),
color = category),
alpha =0.2,
na.rm = TRUE) +
geom_smooth(
mapping = aes(
x = points,
y = log(price)),
na.rm = TRUE) +
facet_wrap(~category) +
labs(
title = "log(Price) by Points",
x = "Points",
y = "log(Price)")+
theme_minimal()+
theme(legend.position = "none" )
To determine the best province for wine by points, we averaged the points of all wines per province with a sample size greater than 30 and returned the top 10 with standard error. Standard error helps us determine the spread of the dataset. The graph below shows top 10 best provinces by average points with respect to standard error.
TODO EXPLAIN OSAKI To determine the best province for wine by points we average all the wines per province and return the top 10 with standard error.
To determine the best variety of wine we use the average point of all wines per variety with a sample size greater than 30. The graph below shows the the top 10 varieties with their respective standard error.
TODO IZZY
A easy way to determine the best wine is by simply finding the top 10 wines.
However, this does not account for the graders bias. Instead, our group “normalized” the points based on each taster based on the number of standard deviations an wines is from the raters average. For example, Taster A could give a wine 100 but has an avgerage rating score of 95 with a standard deviation of 5. Whereas, Taster B could give a wine 91 and have an average score of 87 with a standard deviation of 2. Although, the wine tasted by Taster A got a perfect 100 score, Taster B’s wine was much “better” wine since it was 2 standard deviations from the tasters avgerage compared to 1 standard deviation of the other wine.
Looking at the norm_points these are the top 10 best wines
A simple value metric we can use to determine best value is points/price
However, again this metric is not normalized. Instead, norm_points/price would yield more robust results.